Efficient binary embedding of categorical data using BinSketch
نویسندگان
چکیده
In this work, we present a dimensionality reduction algorithm, aka. sketching, for categorical datasets. Our proposed sketching algorithm Cabin constructs low-dimensional binary sketches from high-dimensional vectors, and our distance estimation Cham computes close approximation of the Hamming between any two original vectors only their sketches. The minimum dimension required by to ensure good theoretically depends on sparsity data points—making it useful many real-life scenarios involving sparse We rigorous theoretical analysis approach supplement with extensive experiments several real-world sets, including one over million dimensions. show that duo is significantly fast accurate tasks such as $$\mathrm {RMSE}$$ , all-pair similarity, clustering when compared working full dataset other techniques.
منابع مشابه
Fast Binary Embedding for High-Dimensional Data
Binary embedding of high-dimensional data requires long codes to preserve the discriminative power of the input space. Traditional binary coding methods often suffer from very high computation and storage costs in such a scenario. To address this problem, we propose two solutions which improve over existing approaches. The first method, Bilinear Binary Embedding (BBE), converts highdimensional ...
متن کاملEfficient manipulation of binary data using pattern matching
Pattern matching is an important operation in functional programs. So far, pattern matching has been investigated in the context of structured terms. This article presents an approach to extend pattern matching to terms without (much of a) structure such as binaries which is the kind of data format that network applications typically manipulate. After introducing the binary datatype and a notat...
متن کاملOn Binary Embedding using Circulant Matrices
Binary embeddings provide efficient and powerful ways to perform operations on large scale data. However binary embedding typically requires long codes in order to preserve the discriminative power of the input space. Thus binary coding methods traditionally suffer from high computation and storage costs in such a scenario. To address this problem, we propose Circulant Binary Embedding (CBE) wh...
متن کاملEfficient Adaptive Data Compression Using Fano Binary Search Trees
In this paper, we show an effective way of using adaptive self-organizing data structures in enhancing compression schemes. We introduce a new data structure, the Partitioning Binary Search Tree (PBST), which is based on the well-known Binary Search Tree (BST), and when used in conjunction with Fano encoding, the PBST leads to the so-called Fano Binary Search Tree (FBST). The PBST and FBST can ...
متن کاملApplication of the Rasch Model in Categorical Pedigree Analysis Using Mcem: I Binary Data
An extension of the Rasch model with correlated latent variables is proposed to model correlated binary data within families. The latent variables have the classical correlation structure of Fisher (1918) and the model parameters thus have genetic interpretations. The proposed model is fitted to data using a hybrid of the Metropolis-Hastings algorithm and the MCEM modification of the EM-algorit...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Data Mining and Knowledge Discovery
سال: 2022
ISSN: ['1573-756X', '1384-5810']
DOI: https://doi.org/10.1007/s10618-021-00815-y